Нужно предсказать оценку, которую рецензент поставил отелю на основании отзыва рецензента и других признаков.
Описание полей датасета
%pip install category_encoders nltk comet_ml sweetviz
Requirement already satisfied: category_encoders in s:\pythonmk\lib\site-packages (2.6.0) Requirement already satisfied: nltk in s:\pythonmk\lib\site-packages (3.8.1) Requirement already satisfied: comet_ml in s:\pythonmk\lib\site-packages (3.32.6) Requirement already satisfied: sweetviz in s:\pythonmk\lib\site-packages (2.1.4) Requirement already satisfied: scikit-learn>=0.20.0 in s:\pythonmk\lib\site-packages (from category_encoders) (1.1.1) Requirement already satisfied: scipy>=1.0.0 in s:\pythonmk\lib\site-packages (from category_encoders) (1.8.1) Requirement already satisfied: numpy>=1.14.0 in s:\pythonmk\lib\site-packages (from category_encoders) (1.22.4+mkl) Requirement already satisfied: statsmodels>=0.9.0 in s:\pythonmk\lib\site-packages (from category_encoders) (0.13.2) Requirement already satisfied: pandas>=1.0.5 in s:\pythonmk\lib\site-packages (from category_encoders) (1.3.5) Requirement already satisfied: patsy>=0.5.1 in s:\pythonmk\lib\site-packages (from category_encoders) (0.5.3) Requirement already satisfied: joblib in s:\pythonmk\lib\site-packages (from nltk) (1.2.0) Requirement already satisfied: regex>=2021.8.3 in s:\pythonmk\lib\site-packages (from nltk) (2023.3.23) Requirement already satisfied: tqdm in s:\pythonmk\lib\site-packages (from nltk) (4.65.0) Requirement already satisfied: click in s:\pythonmk\lib\site-packages (from nltk) (8.1.3) Requirement already satisfied: jsonschema!=3.1.0,>=2.6.0 in s:\pythonmk\lib\site-packages (from comet_ml) (4.17.3) Requirement already satisfied: dulwich!=0.20.33,>=0.20.6 in s:\pythonmk\lib\site-packages (from comet_ml) (0.21.3) Requirement already satisfied: simplejson in s:\pythonmk\lib\site-packages (from comet_ml) (3.18.4) Requirement already satisfied: sentry-sdk>=1.1.0 in s:\pythonmk\lib\site-packages (from comet_ml) (1.18.0) Requirement already satisfied: python-box<7.0.0 in s:\pythonmk\lib\site-packages (from comet_ml) (6.1.0) Requirement already satisfied: requests>=2.18.4 in s:\pythonmk\lib\site-packages (from comet_ml) (2.28.2) Requirement already satisfied: everett[ini]<3.2.0,>=1.0.1 in s:\pythonmk\lib\site-packages (from comet_ml) (3.1.0) Requirement already satisfied: six in s:\pythonmk\lib\site-packages (from comet_ml) (1.16.0) Requirement already satisfied: wurlitzer>=1.0.2 in s:\pythonmk\lib\site-packages (from comet_ml) (3.0.3) Requirement already satisfied: wrapt>=1.11.2 in s:\pythonmk\lib\site-packages (from comet_ml) (1.15.0) Requirement already satisfied: semantic-version>=2.8.0 in s:\pythonmk\lib\site-packages (from comet_ml) (2.10.0) Requirement already satisfied: requests-toolbelt>=0.8.0 in s:\pythonmk\lib\site-packages (from comet_ml) (0.10.1) Requirement already satisfied: websocket-client<1.4.0,>=0.55.0 in s:\pythonmk\lib\site-packages (from comet_ml) (1.3.3) Requirement already satisfied: jinja2>=2.11.1 in s:\pythonmk\lib\site-packages (from sweetviz) (3.1.2) Requirement already satisfied: importlib-resources>=1.2.0 in s:\pythonmk\lib\site-packages (from sweetviz) (5.12.0) Requirement already satisfied: matplotlib>=3.1.3 in s:\pythonmk\lib\site-packages (from sweetviz) (3.5.2) Requirement already satisfied: urllib3>=1.25 in s:\pythonmk\lib\site-packages (from dulwich!=0.20.33,>=0.20.6->comet_ml) (1.26.15) Requirement already satisfied: configobj in s:\pythonmk\lib\site-packages (from everett[ini]<3.2.0,>=1.0.1->comet_ml) (5.0.8) Requirement already satisfied: zipp>=3.1.0 in s:\pythonmk\lib\site-packages (from importlib-resources>=1.2.0->sweetviz) (3.15.0) Requirement already satisfied: MarkupSafe>=2.0 in s:\pythonmk\lib\site-packages (from jinja2>=2.11.1->sweetviz) (2.1.2) Requirement already satisfied: pkgutil-resolve-name>=1.3.10 in s:\pythonmk\lib\site-packages (from jsonschema!=3.1.0,>=2.6.0->comet_ml) (1.3.10) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in s:\pythonmk\lib\site-packages (from jsonschema!=3.1.0,>=2.6.0->comet_ml) (0.19.3) Requirement already satisfied: attrs>=17.4.0 in s:\pythonmk\lib\site-packages (from jsonschema!=3.1.0,>=2.6.0->comet_ml) (22.2.0) Requirement already satisfied: python-dateutil>=2.7 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (2.8.2) Requirement already satisfied: pillow>=6.2.0 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (9.4.0) Requirement already satisfied: fonttools>=4.22.0 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (4.39.3) Requirement already satisfied: cycler>=0.10 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (0.11.0) Requirement already satisfied: pyparsing>=2.2.1 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (3.0.9) Requirement already satisfied: kiwisolver>=1.0.1 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (1.4.4) Requirement already satisfied: packaging>=20.0 in s:\pythonmk\lib\site-packages (from matplotlib>=3.1.3->sweetviz) (23.0) Requirement already satisfied: pytz>=2017.3 in s:\pythonmk\lib\site-packages (from pandas>=1.0.5->category_encoders) (2023.3) Requirement already satisfied: idna<4,>=2.5 in s:\pythonmk\lib\site-packages (from requests>=2.18.4->comet_ml) (3.4) Requirement already satisfied: charset-normalizer<4,>=2 in s:\pythonmk\lib\site-packages (from requests>=2.18.4->comet_ml) (3.1.0) Requirement already satisfied: certifi>=2017.4.17 in s:\pythonmk\lib\site-packages (from requests>=2.18.4->comet_ml) (2022.12.7) Requirement already satisfied: threadpoolctl>=2.0.0 in s:\pythonmk\lib\site-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0) Requirement already satisfied: colorama in s:\pythonmk\lib\site-packages (from tqdm->nltk) (0.4.6) Note: you may need to restart the kernel to use updated packages.
!pip freeze > requariments.txt
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#plt.style.use('ggplot')
plt.style.use('default')
plt.rcParams['figure.figsize'] = (6,4)
from IPython.display import display
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor # инструмент для создания и обучения модели
from sklearn import metrics # инструменты для оценки точности модели
import category_encoders as ce
# тесты
from sklearn.feature_selection import f_classif # anova
from sklearn.feature_selection import chi2 # хи-квадрат
import comet_ml as comet
import sweetviz
# ===============================================
# os.environ['COMET_AUTO_LOG_DISABLE']='False'
# ===============================================
# print(comet.__version__)
# print(sweetviz.__version__)
# Не инициализируем - не используем
COMET_LOG_ENABLE=False
# ===============================================
#WORKSPACE = 'dhegl'
WORKSPACE='dheglsfds'
PROJECT = 'sfds-project-3'
# Для автоотчетов
comet.init(workspace=WORKSPACE, project_name=PROJECT)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
RANDOM_SEED = 42
COMET INFO: Comet API key is valid
## а у нас BLAS, а у вас?
#np.show_config()
file_id = '11bOevjZ9P69ZJlB9uLPSOzkk-uge7y8W'
google_drive_url='https://drive.google.com/uc?export=download&confirm=no_antivirus&id='
remote_url = google_drive_url + file_id
local_file = '../input/sf-booking/hotels_train.zip'
if os.path.exists('../input/sf-booking/') and os.path.exists(local_file):
print('Load local:', local_file)
url = local_file
else:
print('Load remote:', remote_url)
url = remote_url
Load remote: https://drive.google.com/uc?export=download&confirm=no_antivirus&id=11bOevjZ9P69ZJlB9uLPSOzkk-uge7y8W
hotels_df = pd.read_csv(url, compression='zip')
hotels_df.head(2)
| hotel_address | additional_number_of_scoring | review_date | average_score | hotel_name | reviewer_nationality | negative_review | review_total_negative_word_counts | total_number_of_reviews | positive_review | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | reviewer_score | tags | days_since_review | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Stratton Street Mayfair Westminster Borough Lo... | 581 | 2/19/2016 | 8.4 | The May Fair Hotel | United Kingdom | Leaving | 3 | 1994 | Staff were amazing | 4 | 7 | 10.0 | [' Leisure trip ', ' Couple ', ' Studio Suite ... | 531 day | 51.507894 | -0.143671 |
| 1 | 130 134 Southampton Row Camden London WC1B 5AF... | 299 | 1/12/2017 | 8.3 | Mercure London Bloomsbury Hotel | United Kingdom | poor breakfast | 3 | 1361 | location | 2 | 14 | 6.3 | [' Business trip ', ' Couple ', ' Standard Dou... | 203 day | 51.521009 | -0.123097 |
hotels_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 386803 entries, 0 to 386802 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel_address 386803 non-null object 1 additional_number_of_scoring 386803 non-null int64 2 review_date 386803 non-null object 3 average_score 386803 non-null float64 4 hotel_name 386803 non-null object 5 reviewer_nationality 386803 non-null object 6 negative_review 386803 non-null object 7 review_total_negative_word_counts 386803 non-null int64 8 total_number_of_reviews 386803 non-null int64 9 positive_review 386803 non-null object 10 review_total_positive_word_counts 386803 non-null int64 11 total_number_of_reviews_reviewer_has_given 386803 non-null int64 12 reviewer_score 386803 non-null float64 13 tags 386803 non-null object 14 days_since_review 386803 non-null object 15 lat 384355 non-null float64 16 lng 384355 non-null float64 dtypes: float64(4), int64(5), object(8) memory usage: 50.2+ MB
hotels_df.duplicated().sum()
307
# в "боевом" варианте этого делать не будем
hotels_df.drop_duplicates(inplace=True)
hotels_df.describe(include='object')
| hotel_address | review_date | hotel_name | reviewer_nationality | negative_review | positive_review | tags | days_since_review | |
|---|---|---|---|---|---|---|---|---|
| count | 386496 | 386496 | 386496 | 386496 | 386496 | 386496 | 386496 | 386496 |
| unique | 1493 | 731 | 1492 | 225 | 248828 | 311737 | 47135 | 731 |
| top | 163 Marsh Wall Docklands Tower Hamlets London ... | 8/2/2017 | Britannia International Hotel Canary Wharf | United Kingdom | No Negative | No Positive | [' Leisure trip ', ' Couple ', ' Double Room '... | 1 days |
| freq | 3587 | 1910 | 3587 | 183952 | 95833 | 26863 | 3853 | 1910 |
hotels_df.describe().round(2)
| additional_number_of_scoring | average_score | review_total_negative_word_counts | total_number_of_reviews | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | reviewer_score | lat | lng | |
|---|---|---|---|---|---|---|---|---|---|
| count | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 384048.00 | 384048.00 |
| mean | 498.50 | 8.40 | 18.54 | 2744.68 | 17.78 | 7.18 | 8.40 | 49.44 | 2.82 |
| std | 500.37 | 0.55 | 29.70 | 2316.93 | 21.72 | 11.05 | 1.64 | 3.47 | 4.58 |
| min | 1.00 | 5.20 | 0.00 | 43.00 | 0.00 | 1.00 | 2.50 | 41.33 | -0.37 |
| 25% | 169.00 | 8.10 | 2.00 | 1161.00 | 5.00 | 1.00 | 7.50 | 48.21 | -0.14 |
| 50% | 342.00 | 8.40 | 9.00 | 2134.00 | 11.00 | 3.00 | 8.80 | 51.50 | -0.00 |
| 75% | 660.00 | 8.80 | 23.00 | 3633.00 | 22.00 | 8.00 | 9.60 | 51.52 | 4.83 |
| max | 2682.00 | 9.80 | 408.00 | 16670.00 | 395.00 | 355.00 | 10.00 | 52.40 | 16.43 |
# Часть описательных признаков переведем в категориальные
# для ускорения обработки и уменьшения занимаемой датасетом памяти
category_columns = ['hotel_address', 'hotel_name', 'reviewer_nationality']
hotels_df[category_columns] = hotels_df[category_columns].astype('category')
hotels_df.isna().sum()
hotel_address 0 additional_number_of_scoring 0 review_date 0 average_score 0 hotel_name 0 reviewer_nationality 0 negative_review 0 review_total_negative_word_counts 0 total_number_of_reviews 0 positive_review 0 review_total_positive_word_counts 0 total_number_of_reviews_reviewer_has_given 0 reviewer_score 0 tags 0 days_since_review 0 lat 2448 lng 2448 dtype: int64
# "Утереную" локацию отелей востановим по средней величине страны расположения. Все равно все вокруг столиц "крутится".
# Точность тут не так важна, т.к. процент отелей без указаных координат крайне мал.
# Выделим страну из адреса отеля и заполним среднем по широте и дологоте
hotels_df['country'] = hotels_df['hotel_address'].str.extract(r'.+ (.+)$')\
.astype('category').replace('Kingdom', 'United Kingdom')
lat_mean = hotels_df.groupby('country')['lat'].mean()
lat_mean
country Austria 48.203367 France 48.863806 Italy 45.479617 Netherlands 52.362211 Spain 41.389125 United Kingdom 51.510737 Name: lat, dtype: float64
lng_mean = hotels_df.groupby('country')['lng'].mean()
lng_mean
country Austria 16.367176 France 2.326842 Italy 9.191845 Netherlands 4.885346 Spain 2.169152 United Kingdom -0.139075 Name: lng, dtype: float64
# Вариант 1 (наглядный)
def set_lat(x:pd.Series):
return lat_mean[x['country']]
#--------------------------------
hotels_df.loc[hotels_df['lat'].isna(),'lat'] = hotels_df[hotels_df['lat'].isna()].apply(set_lat, axis=1)
#hotels_df['lat'].isna().sum()
# Вариант 2 (лямбда)
hotels_df.loc[hotels_df['lng'].isna(),'lng'] = \
hotels_df[hotels_df['lng'].isna()].apply(lambda x: lng_mean[x['country']], axis=1)
#hotels_df['lng'].isna().sum()
#plt.scatter(hotels_df['lat'], hotels_df['lng']);
fig, ax = plt.subplots(figsize=(6,5))
sns.scatterplot(hotels_df.sample(30000), x='lng', y='lat', hue='country')
# и без geopandas понятно, что где то в европах.
plt.title('Разброс отелей по координатам');
# ну или координаты разброса. кому как нравится
Целевой показатель, который необходимо смоделировать
print('Rev`s score min:', hotels_df.reviewer_score.min())
print('Rev`s score max:', hotels_df.reviewer_score.max())
print('Rev`s score mean:', hotels_df.reviewer_score.mean().round(1))
print('Rev`s score median:', hotels_df.reviewer_score.median())
Rev`s score min: 2.5 Rev`s score max: 10.0 Rev`s score mean: 8.4 Rev`s score median: 8.8
Один из важных базовых признаков.
print('Hotel avg score min:', hotels_df.average_score.min())
print('Hotel avg score max', hotels_df.average_score.max())
print('Hotel score mean:', hotels_df.average_score.mean().round(1))
print('Hotel score median:', hotels_df.average_score.median())
Hotel avg score min: 5.2 Hotel avg score max 9.8 Hotel score mean: 8.4 Hotel score median: 8.4
# Распределение средних оценок -средней за последний год и за весь период выборки
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(12,4) )
sns.histplot(hotels_df.groupby('hotel_name')['average_score'].mean(), ax=ax[0], bins=22);
ax[0].set_title('Hotels avg score');
sns.histplot(hotels_df.groupby('hotel_name')['reviewer_score'].mean(), ax=ax[1], bins=23);
ax[1].set_title('Reviewers mean score');
fig,ax = plt.subplots(figsize=(12,2))
sns.boxplot(hotels_df[['average_score','reviewer_score']], orient='h');
plt.title("Score distribution");
Сравнивать эти два показателя не совсем корректно, т.к. average_score это уже усредненная оценка. Поэтому получим усредненую оценку от рецензентов по отелям за весь период выборки.
by_mean_score = hotels_df.groupby('hotel_name')['reviewer_score'].mean().round(1)
hotels_df['reviever_mean_score'] = hotels_df.hotel_name.apply(lambda n: by_mean_score.loc[n])
fig,ax = plt.subplots(figsize=(12,2))
sns.boxplot(hotels_df[['average_score', 'reviever_mean_score', 'reviewer_score']], orient='h');
plt.title("Score distribution");
То же самое, но с медианным показателем (с учетом выбросов)
by_mean_score = hotels_df.groupby('hotel_name')['reviewer_score'].median()
hotels_df['reviever_median_score'] = hotels_df.hotel_name.apply(lambda n: by_mean_score.loc[n])
fig,ax = plt.subplots(figsize=(12,2))
sns.boxplot(hotels_df[['average_score', 'reviever_median_score', 'reviewer_score']], orient='h');
plt.title("Score distribution");
Вот теперь заметно что медиана средней оценки от постояльцев несколько завышена, по сравнению с указаной в датасете, усредненой за последний год . Но поскольку нас интересует средняя оценка по конкретному отелю, статистическая значимость отличия в целом не имеет смысла. Просто учтем это при моделировании.
Оба признака имею скорее категориальный порядковый тип, представленый двухразрядным числом.
print('Кол-во уникальных')
print(' reviewer_score:', hotels_df.reviewer_score.nunique())
print(' average_score:', hotels_df.average_score.nunique())
print(' reviever_mean_score:', hotels_df.reviever_mean_score.nunique())
print(' reviever_median_score:', hotels_df.reviever_median_score.nunique())
Кол-во уникальных reviewer_score: 37 average_score: 34 reviever_mean_score: 39 reviever_median_score: 31
Среднее рашрияет диапазон оценок, а вот медианное значение наоборот его сужает.
review_date и days_since_review, взаимо заменяемые признаки.
hotels_df['review_date'] = pd.to_datetime(hotels_df['review_date'],format= "%m/%d/%Y")
hotels_df['days_since_review'] = hotels_df['days_since_review'].str.extract(r'(\d+) day')
hotels_df['days_since_review'] = pd.to_numeric(hotels_df['days_since_review'], downcast='integer')
days_since_review
Распределение +/- равномерное, а значить можно учитывать влияние "сезонности" на оценку оставленую посетителем.
hotels_df['days_since_review'].hist();
plt.title("Количество оставленых ревью в день");
# Условная "сезоность", но более детализированная по месяцам
# В целом ничего интересного, +/- ~7тыс. ревью, uniform
hotels_df['month'] = hotels_df['review_date'].dt.month.astype(int)
hotels_df['month'].hist(bins=12);
by_month = hotels_df.groupby('month')['reviewer_score'].mean()
ax=sns.barplot(by_month.reset_index(), x='month', y='reviewer_score');
ax.set_ylim(by_month.min()-0.01, by_month.max()+0.01)
ax.set_title("Сезонные колебания рейтинга по месяцам от 8+");
Еще один из важных признаков, т.к. как имеем дело с культурными особеностями гостей.
# Кол-во гостей в разрезе национальности. Первая семерка
print('Rev-s nationality count:', hotels_df['reviewer_nationality'].nunique())
hotels_df['reviewer_nationality'].value_counts().head(7)
# Заметен явное преобладание гостей из Англии (в разы)
Rev-s nationality count: 225
United Kingdom 183952 United States of America 26494 Australia 16216 Ireland 11119 United Arab Emirates 7612 Saudi Arabia 6716 Netherlands 6598 Name: reviewer_nationality, dtype: int64
Так как попытка использовать этот признак через бинарное кодирование особо ситуацию не улучшило, пойдем другим путем. Выделим лидеров по высокой и низкой даваемой отелю оценке (ПыСы - получили -0.0001 в MAPE :D)
# выделим из топ50 по количеству топ10 с высоким средним рейтингом и топ10 с низким
by_nationality = hotels_df.groupby('reviewer_nationality')[['reviewer_nationality', 'reviewer_score']].agg(['count', 'mean']).round(1)
display(by_nationality.sort_values(by=('reviewer_score', 'count'), ascending=False).head(50)\
.sort_values(by=('reviewer_score', 'mean'), ascending=False).head(10))
top_10_nation = by_nationality.sort_values(by=('reviewer_score', 'count'), ascending=False).head(50)\
.sort_values(by=('reviewer_score', 'mean'), ascending=False).head(10).index.to_list()
| reviewer_score | ||
|---|---|---|
| count | mean | |
| reviewer_nationality | ||
| United States of America | 26494 | 8.8 |
| Israel | 4912 | 8.7 |
| Canada | 5977 | 8.6 |
| Australia | 16216 | 8.6 |
| New Zealand | 2443 | 8.6 |
| United Kingdom | 183952 | 8.5 |
| Hungary | 1666 | 8.5 |
| China | 2562 | 8.5 |
| Malta | 1250 | 8.5 |
| Ireland | 11119 | 8.5 |
display(by_nationality.sort_values(by=('reviewer_score', 'count'), ascending=False).head(50)\
.sort_values(by=('reviewer_score', 'mean'), ascending=True).head(10))
low_10_nation = by_nationality.sort_values(by=('reviewer_score', 'count'), ascending=False).head(50)\
.sort_values(by=('reviewer_score', 'mean'), ascending=True).head(10).index.to_list()
| reviewer_score | ||
|---|---|---|
| count | mean | |
| reviewer_nationality | ||
| United Arab Emirates | 7612 | 7.9 |
| Saudi Arabia | 6716 | 7.9 |
| Qatar | 2044 | 7.9 |
| Oman | 1019 | 7.9 |
| India | 2556 | 7.9 |
| Bahrain | 1179 | 8.0 |
| Portugal | 1379 | 8.0 |
| Lebanon | 1697 | 8.0 |
| Kuwait | 3700 | 8.0 |
| Turkey | 4102 | 8.0 |
# бинарный признак top_10_nation по среднему рейтингу
hotels_df['top_10_nation'] = hotels_df['reviewer_nationality'].apply(lambda x: 1 if x in top_10_nation else 0)
hotels_df[hotels_df['top_10_nation']==1][['reviewer_nationality']].sample(7)
| reviewer_nationality | |
|---|---|
| 60514 | Israel |
| 212208 | Canada |
| 341458 | United Kingdom |
| 312369 | United Kingdom |
| 324829 | United Kingdom |
| 3992 | United Kingdom |
| 285360 | Australia |
# бинарный признак low_10_nation по среднему рейтингу
hotels_df['low_10_nation'] = hotels_df['reviewer_nationality'].apply(lambda x: 1 if x in low_10_nation else 0)
hotels_df[hotels_df['low_10_nation']==1][['reviewer_nationality']].sample(7)
| reviewer_nationality | |
|---|---|
| 318063 | Saudi Arabia |
| 204390 | Qatar |
| 64613 | Saudi Arabia |
| 262926 | Portugal |
| 215421 | Saudi Arabia |
| 49407 | Saudi Arabia |
| 208737 | Turkey |
# Этот признак мы уже выделяли в блоке очистки данных для получения "средних по больницам" координат
hotels_df['country'].value_counts()
United Kingdom 196773 Spain 45132 France 44528 Netherlands 43004 Austria 29177 Italy 27882 Name: country, dtype: int64
ax = sns.countplot(hotels_df, x='country', order=hotels_df['country'].value_counts().index)
ax.set_title('Кол-во ревью по странам локации отелей');
# На острове все педантичненько, без ревью с гостиницы не съедишь
by_country = hotels_df.groupby('country')['reviewer_score'].mean()
ax=sns.barplot(by_country.reset_index(), x='country', y='reviewer_score');
ax.set_ylim(by_country.min()-0.01, by_country.max()+0.01)
ax.set_title("Средняя оценка по странам локации отелей от 8+");
# "Цель прибывания" или что то типа того. Разделим на деловую и просто пошарахаться/отдых/шоппинг
hotels_df['purpose_arrival'] = hotels_df['tags'].str.extract(r'([LB].+) trip') # .?\'?
hotels_df['purpose_arrival'].value_counts()
Leisure 313353 Business 61934 Name: purpose_arrival, dtype: int64
ax=sns.countplot(hotels_df, x='purpose_arrival');
ax.set_title('Кол-во ревью по типу поездки(цели прибывания)');
# Возмем отдельно только признак "бизнес-тур, а все остальное - в остальное.
hotels_df['is_business'] = hotels_df['purpose_arrival'].apply(lambda x: 1 if x=='Business' else 0)
# Похоже на кол-во заявленых суток пребывания. Не факт, что факт, но планы то были.
# учтем-с как числовой. Мало-ли, вдруг влияет на субъективность оценки.
hotels_df['stayed_nights'] = hotels_df['tags'].str.extract(r'Stayed (\d+) night').fillna(0).astype(int)
#hotels_df['stayed_nights'].value_counts().head(5)
sns.histplot(hotels_df, x='stayed_nights', bins=30);
plt.title("Распределение кол-ва ревью\nв разрезе суток пребывания");
fig, ax = plt.subplots(figsize=(8,2))
sns.boxplot(hotels_df, x='stayed_nights', ax=ax);
plt.title("Распределение кол-ва ревью\nв разрезе суток пребывания");
by_stayed_nights = hotels_df.groupby('stayed_nights')['reviewer_score'].mean()
fig, ax = plt.subplots(figsize=(8,4))
ax=sns.barplot(by_stayed_nights.reset_index().\
sort_values(by='stayed_nights'), x='stayed_nights', y='reviewer_score');
ax.set_ylim(by_stayed_nights.min()-.01, by_stayed_nights.max()+.01)
ax.set_title("Средний рейтинг от времени пребывания.");
Так как после 10-15 суток условно-потенциальные выбросы, интересен только низкий рейтин в нуле. Хотя заметен спад рейтига по мере увеличения времени прибывания.
Количествово проживающих на ревью: соло, пара, семья, группа.
hotels_df['is_solo_travel'] = hotels_df['tags'].str.contains(r'Solo traveler')
ax = sns.countplot(hotels_df, x='is_solo_travel');
ax.set_title('Solo vis other');
hotels_df['is_couple_travel'] = hotels_df['tags'].str.contains(r'Couple')
ax=sns.countplot(hotels_df, x='is_couple_travel');
ax.set_title('Couple vis other');
hotels_df['is_family_travel'] = hotels_df['tags'].str.contains(r'Family')
ax=sns.countplot(hotels_df, x='is_family_travel');
ax.set_title('Family vis other');
hotels_df['is_group_travel'] = hotels_df['tags'].str.contains(r'Group')
ax=sns.countplot(hotels_df, x='is_group_travel');
ax.set_title('Group vis other');
hotels_df[['negative_review']].value_counts().head(10)
# "No Negative" еще ладно, но "Breakfast"... это разве плохо? гораздо хуже когда его нет
negative_review
No Negative 95833
Nothing 10733
Nothing 3152
nothing 1658
N A 802
None 737
606
N a 384
Breakfast 296
Small room 283
dtype: int64
hotels_df[['positive_review']].value_counts().head(10)
positive_review No Positive 26863 Location 6824 Everything 1697 location 1248 Nothing 930 The location 828 Great location 807 Good location 690 Location 663 Breakfast 455 dtype: int64
Они то и формируют объективный и субъективный отклик проживающего.
sns.histplot(hotels_df, x='review_total_negative_word_counts', bins=50);
plt.title("Распределение по количеству\nнегативных слов в ревью");
sns.histplot(hotels_df, x='review_total_positive_word_counts', bins=50);
plt.title("Распределение по количеству\nпозитивных слов в ревью");
score_period = (hotels_df['review_date'].max() - hotels_df['review_date'].min()).days
print('Временой период обработки данных(дни):', score_period)
Временой период обработки данных(дни): 730
Так как распределение положительных оценок и отрицательных в обзорах слегка не соответсвуют действительности, сформируем пару признаков отражающих эти оценки более реалистично.
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.downloader.download('vader_lexicon')
# nltk.downloader.download('stopwords')
# from nltk.corpus import stopwords
# Очень долго работает на Colab и Kaggle. Если нужен подбор других
# параметров(или отладка), можно этот этап и пропустить, оставив константами.
hotels_df[['neg_ratio', 'pos_ratio']] = [.5, .5]
[nltk_data] Downloading package vader_lexicon to [nltk_data] C:\Users\Admin\AppData\Roaming\nltk_data... [nltk_data] Package vader_lexicon is already up-to-date!
# Пойдем самым линейным путем
# в "No Negative" вложим позитивную коннотацию
hotels_df['negative_review_'] = hotels_df['negative_review'].str.replace("No Negative", "Positive")
# в "No Positive" вложим негативную коннотацию
hotels_df['positive_review_'] = hotels_df['positive_review'].str.replace("No Positive", "Negative")
hotels_df[['negative_review_']].value_counts().head(7)
negative_review_
Positive 95833
Nothing 10733
Nothing 3152
nothing 1658
N A 802
None 737
606
dtype: int64
hotels_df[['positive_review_']].value_counts().head(7)
positive_review_ Negative 26863 Location 6824 Everything 1697 location 1248 Nothing 930 The location 828 Great location 807 dtype: int64
%%time
sent_analyzer = SentimentIntensityAnalyzer()
def _analyzer(review):
result = sent_analyzer.polarity_scores(review)
return pd.Series([result['neg'], result['pos']])
# склеим ревью и расчитаем показатели
hotels_df[['neg_ratio', 'pos_ratio']] = (hotels_df['negative_review_'] + ';' + hotels_df['positive_review_']).apply(_analyzer)
hotels_df.drop(columns=['negative_review_', 'positive_review_'], inplace=True)
CPU times: total: 3min 30s Wall time: 3min 31s
# пощелкаем в сэмплах
hotels_df[['negative_review', 'neg_ratio', 'positive_review', 'pos_ratio']].sample(10)
| negative_review | neg_ratio | positive_review | pos_ratio | |
|---|---|---|---|---|
| 72281 | The glass shower in the middle of the room Th... | 0.139 | No Positive | 0.000 |
| 262051 | I think you need a usher in the cinema 20 min... | 0.000 | The friendliness of the staff and such a beau... | 0.182 |
| 268796 | rooms are quite small also think breakfast co... | 0.134 | well decorated rooms and shower is to die for... | 0.141 |
| 69230 | Room was smaller than photo | 0.000 | Location | 0.000 |
| 136811 | Not enough info on how to make the most of th... | 0.000 | Lovely d cor quality fittings sophisticated t... | 0.388 |
| 136437 | No Negative | 0.000 | The service was excellent It was my parents g... | 0.448 |
| 90593 | No Negative | 0.010 | We really enjoyed this accommodation with its... | 0.243 |
| 90301 | Sauna wasn t on when we went to use it but wa... | 0.000 | Nice friendly staff who are more than happy t... | 0.472 |
| 36578 | No Negative | 0.000 | Room dising | 0.643 |
| 152115 | A greater range of fruit at breakfast would h... | 0.000 | Very clean and tidy hotel with good facilitie... | 0.476 |
# такое себе, но попробуем отдельным экспериментом. ( Спойлер - оказался не прав ;) )
by_target = hotels_df.groupby('reviewer_score')[['neg_ratio', 'pos_ratio']].mean()
fig, ax = plt.subplots(figsize=(6,5))
sns.lineplot(by_target)
plt.title('Зависимость среднего рейтинга от коофициента\nколичества позитива и негатива в обзорах');
Как видим, близкая к линейной зависимости целевой переменной от комбинации позитивной и негативной коннотации в ревью.
%%time
### Уменьшаем объем отчета, по уже обработаным признакам на данном этапе
no_report_columns = ['hotel_address', 'tags', 'positive_review', 'negative_review']
report = sweetviz.analyze(hotels_df.drop(columns=no_report_columns), target_feat='reviewer_score')
| | [ 0%] 00:00 -> (? left)
CPU times: total: 1min 50s Wall time: 1min 50s
### (!) Лучше так не делать. Смешивая с ML эксперементами - ломается основная панель Comet/ML с графиками метрик
### + у меня перестал логироваться HTML, возможно это связано с... чертегознаетсчем.
# ### Сохраним аналитику в отчете sweetviz в ML трекере.
# exp_version[1] += 1
# experiment = comet.Experiment(**EXPERIMENT_PARAMS)
# experiment.set_name('Analitics')
# experiment.add_tag('EDA {}.{}'.format(exp_version[0], exp_version[1]))
# ### Сохраняем в трекер
# report.log_comet(experiment)
# experiment.end()
# experiment.end()
# ====================================================
os.environ['COMET_AUTO_LOG_DISABLE']='True'
# ====================================================
report.show_notebook(h=700, w=1100, layout='vertical')
os.environ['COMET_AUTO_LOG_DISABLE']='False'
# ====================================================
COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting.
Выводы из отчета
# ## Рискнем, интересно же.
# ## (ПыСы сомнительная вещь, всего два отеля из полторы тысячи. отказался от этой идеи)
# hotels_name = ['Britannia International Hotel Canary Wharf']
# hotels_df['down_score_leader'] = hotels_df['hotel_name'].apply(lambda x : 1 if x in hotels_name else 0)
# hotels_name = ['Intercontinental London The O2']
# hotels_df['up_score_leader'] = hotels_df['hotel_name'].apply(lambda x : 1 if x in hotels_name else 0)
# Плавно переходим к ML и подбору/перебору и оценки признаков
# раскоментируйте это если вам нужен трэкинг этапов Comet|ML
## ==========================================================
COMET_LOG_ENABLE=True
## ==========================================================
def delete_project(do=False):
"""Удаляет проект на comet.ml"""
if not do:
return
api = comet.API()
try:
result = api.delete_project(
workspace=WORKSPACE,
project_name=PROJECT,
delete_experiments=True
)
print(result)
except Exception as e:
print(e)
EXPERIMENT_PARAMS = dict(workspace=WORKSPACE, project_name=PROJECT,
auto_output_logging="default", display_summary_level=1,
log_code=False, log_graph=False, auto_param_logging=False,
log_git_metadata=False, log_git_patch=False,
auto_metric_logging=False, auto_log_co2=False,
log_env_details=False,# log_env_host=False
# log_env_gpu=False, log_env_cpu=False,
)
##delete_project(True)
### Создаем comet.ml проект
## api_key in /<USERHOMEDIR>/.comet.config
if COMET_LOG_ENABLE:
## Удаляем проект и эксперименты для начала с "чистого листа"
delete_project(True)
api = comet.API()
project = api.create_project(
workspace=WORKSPACE,
project_name=PROJECT,
project_description='SFDS ML Workflow',
public=True,
)
print('project:', project)
#time.sleep(1)
project_notes="""
<a href="https://skillfactory.ru/">
<img src="https://raw.githubusercontent.com/dhegl/sf_ds/64c052f95af5d042844ed56f765c2cbb566d1680/main/static/medium.svg" alt="Онлайн-школа SkillFactory" width="160px" align="right" />
</a>
# **SFDS EDA+TRAINING MODEL LEARNING PROJECT**
- create learning project for model training experiments
- logging all workflow phases
|Begin version|State|
|---|--:|
|0.0.12|3|
"""
results = api.set_project_notes(
workspace=WORKSPACE,
project_name=PROJECT,
notes=project_notes
)
print('Send result', results)
<Response [200]>
project: {'projectId': 'd191d501c69a46b981bd949b6ea6a6f6'}
Send result {'msg': 'Saved', 'code': 200, 'data': None, 'sdk_error_code': 0}
RANDOM_SEED = 4242
TARGET = 'reviewer_score'
# Начинаем трекинг лога обучения с созданием эксперемента в comet.ML
exp_version = [1,0]
exp_log = {}
# По одному эксперементу на подбор признаков
# Первым идет бэйзлаин
hotels_df.head(3)
| hotel_address | additional_number_of_scoring | review_date | average_score | hotel_name | reviewer_nationality | negative_review | review_total_negative_word_counts | total_number_of_reviews | positive_review | ... | low_10_nation | purpose_arrival | is_business | stayed_nights | is_solo_travel | is_couple_travel | is_family_travel | is_group_travel | neg_ratio | pos_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Stratton Street Mayfair Westminster Borough Lo... | 581 | 2016-02-19 | 8.4 | The May Fair Hotel | United Kingdom | Leaving | 3 | 1994 | Staff were amazing | ... | 0 | Leisure | 0 | 2 | False | True | False | False | 0.000 | 0.559 |
| 1 | 130 134 Southampton Row Camden London WC1B 5AF... | 299 | 2017-01-12 | 8.3 | Mercure London Bloomsbury Hotel | United Kingdom | poor breakfast | 3 | 1361 | location | ... | 0 | Business | 1 | 1 | False | True | False | False | 0.608 | 0.000 |
| 2 | 151 bis Rue de Rennes 6th arr 75006 Paris France | 32 | 2016-10-18 | 8.9 | Legend Saint Germain by Elegancia | China | No kettle in room | 6 | 406 | No Positive | ... | 0 | Leisure | 0 | 3 | True | False | False | False | 0.663 | 0.000 |
3 rows × 32 columns
exp_version[1] += 1
exp_name = 'baseline'
if COMET_LOG_ENABLE:
experiment = comet.Experiment(**EXPERIMENT_PARAMS)
experiment.set_name(exp_name)
experiment.add_tag('learning {}.{}'.format(exp_version[0], exp_version[1]))
COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. COMET INFO: Couldn't find a Git repository in 'C:\\Python\\learning\\eda.kaggle\\project-3-kaggle\\code' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY` COMET INFO: Experiment is live on comet.com https://www.comet.com/dheglsfds/sfds-project-3/86462d90bd474f5eaefafb7fec38c14c
hotels_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel_address 386496 non-null category 1 additional_number_of_scoring 386496 non-null int64 2 review_date 386496 non-null datetime64[ns] 3 average_score 386496 non-null float64 4 hotel_name 386496 non-null category 5 reviewer_nationality 386496 non-null category 6 negative_review 386496 non-null object 7 review_total_negative_word_counts 386496 non-null int64 8 total_number_of_reviews 386496 non-null int64 9 positive_review 386496 non-null object 10 review_total_positive_word_counts 386496 non-null int64 11 total_number_of_reviews_reviewer_has_given 386496 non-null int64 12 reviewer_score 386496 non-null float64 13 tags 386496 non-null object 14 days_since_review 386496 non-null int16 15 lat 386496 non-null float64 16 lng 386496 non-null float64 17 country 386496 non-null object 18 reviever_mean_score 386496 non-null float64 19 reviever_median_score 386496 non-null float64 20 month 386496 non-null int32 21 top_10_nation 386496 non-null int64 22 low_10_nation 386496 non-null int64 23 purpose_arrival 375287 non-null object 24 is_business 386496 non-null int64 25 stayed_nights 386496 non-null int32 26 is_solo_travel 386496 non-null bool 27 is_couple_travel 386496 non-null bool 28 is_family_travel 386496 non-null bool 29 is_group_travel 386496 non-null bool 30 neg_ratio 386496 non-null float64 31 pos_ratio 386496 non-null float64 dtypes: bool(4), category(3), datetime64[ns](1), float64(8), int16(1), int32(2), int64(8), object(5) memory usage: 83.4+ MB
train_columns = []
train_num_columns = [
'average_score',
# 'reviever_mean_score',
# 'reviever_median_score',
'review_total_negative_word_counts',
'review_total_positive_word_counts',
'total_number_of_reviews_reviewer_has_given',
'total_number_of_reviews',
]
train_columns.extend(train_num_columns)
hotels_df[train_columns].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 average_score 386496 non-null float64 1 review_total_negative_word_counts 386496 non-null int64 2 review_total_positive_word_counts 386496 non-null int64 3 total_number_of_reviews_reviewer_has_given 386496 non-null int64 4 total_number_of_reviews 386496 non-null int64 dtypes: float64(1), int64(4) memory usage: 25.8 MB
train_bin_columns = [
# 'is_business',
# 'is_solo_travel',
# 'is_couple_travel',
# 'is_family_travel',
#'top_10_nation',
#'low_10_nation',
]
train_cat_columns = [
# 'reviewer_nationality',
# 'month',
# 'country',
]
train_columns.extend(train_cat_columns+train_bin_columns)
hotels_df[train_columns].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 average_score 386496 non-null float64 1 review_total_negative_word_counts 386496 non-null int64 2 review_total_positive_word_counts 386496 non-null int64 3 total_number_of_reviews_reviewer_has_given 386496 non-null int64 4 total_number_of_reviews 386496 non-null int64 dtypes: float64(1), int64(4) memory usage: 25.8 MB
## Здесь могут быть ньюансы, выделим отдельным подблоком
train_data = hotels_df[train_columns+[TARGET]].copy()
## Все булевые признаки переводим в числовые [0,1]
## convert bool fields to int
for column in train_data.columns:
if train_data[column].dtype == 'bool':
train_data[column] = train_data[column].astype(int)
## Binary Encode
# encode_cols=['reviewer_nationality']
# ce_binary = ce.BinaryEncoder(cols=encode_cols)
# train_data = ce_binary.fit_transform(train_data)
## One Hot Encode
#encode_cols=['month', 'country']
encode_cols=train_cat_columns
ce_onehot = ce.OneHotEncoder(cols=encode_cols, use_cat_names=True)
train_data = ce_onehot.fit_transform(train_data)
# Шкалирование числовых признаков
from sklearn.preprocessing import MinMaxScaler
_scaler = MinMaxScaler()
_scaler.fit(train_data[train_num_columns])
train_data[train_num_columns] = _scaler.fit_transform(train_data[train_num_columns])
# from sklearn.preprocessing import StandardScaler
# _scaler = StandardScaler()
# _scaler.fit(train_data)
# train_data[train_columns] = _scaler.fit_transform(train_data)
#pearson, spearman, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Карта корреляции числовых признаков')
matrix_corr = train_data[train_num_columns+[TARGET]].corr(method='pearson')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='pearson_correlation', step=exp_version[1]);
#pearson, spearman, kendall
num_as_cat_cols = [
'average_score',
# 'reviever_mean_score',
# 'reviever_median_score',
] + [x for x in train_data.columns if train_data[x].nunique()==2]
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(6, 5))
ax.set_title('Карта корреляции категориальных признаков')
matrix_corr = train_data[num_as_cat_cols+[TARGET]].corr(method='spearman').round(2)
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='spearman_correlation', step=exp_version[1]);
#pearson, spearman, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(8, 6))
ax.set_title('Карта корреляции всех признаков модели')
matrix_corr = train_data.corr(method='pearson').round(2)
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='all_predict_corr', step=exp_version[1]);
train_data.head(3)
| average_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | reviewer_score | |
|---|---|---|---|---|---|---|
| 0 | 0.695652 | 0.007353 | 0.010127 | 0.016949 | 0.117339 | 10.0 |
| 1 | 0.673913 | 0.007353 | 0.005063 | 0.036723 | 0.079269 | 6.3 |
| 2 | 0.804348 | 0.014706 | 0.000000 | 0.036723 | 0.021832 | 7.5 |
X = train_data.drop([TARGET], axis=1)
y = train_data[TARGET].values
## непрерывные признаки
num_cols = [x for x in X.columns if X[x].nunique()>2]
if num_cols:
y_ = y.astype('int')
from sklearn.feature_selection import f_classif # anova
imp_num = pd.Series(f_classif(X[num_cols], y_)[0], index = num_cols)
imp_num.sort_values(inplace = True)
imp_num.plot(kind = 'barh');
# категориальные признаки (по сути все уже бинарные)
cat_cols = [x for x in X.columns if X[x].nunique()==2]
if cat_cols:
y_ = y.astype('int')
from sklearn.feature_selection import chi2 # хи-квадрат
imp_cat = pd.Series(chi2(X[cat_cols], y_)[0], index=cat_cols)
imp_cat.sort_values(inplace=True)
imp_cat.plot(kind='barh');
# Воспользуемся специальной функцие train_test_split для разбивки тестовых данных
# выделим 20% данных на валидацию (параметр test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
# Создаём модель (НАСТРОЙКИ НЕ ТРОГАЕМ)
params = dict(
n_estimators=100,
verbose=1,
n_jobs=-1,
random_state=RANDOM_SEED)
model = RandomForestRegressor(**params)
# логируем гиперпараметры модели
params['model'] ='RandomForestRegressor'
params['target'] = TARGET
params['features_count'] = len(X.columns)
params['features_name'] = X.columns.to_list()
if COMET_LOG_ENABLE:
experiment.log_parameters(parameters=params, step=exp_version[1]) #, step=1, prefix='comet_'
#display(params)
%%time
start_learn = time.time()
# Обучаем модель на тестовом наборе данных
model.fit(X_train, y_train)
#
learning_time = time.time() - start_learn
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 20.3s
CPU times: total: 2min 27s Wall time: 43 s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 42.9s finished
%%time
start_predict = time.time()
# Используем обученную модель для предсказания рейтинга
y_pred = model.predict(X_test)
#
predict_time = time.time() - start_predict
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.6s
CPU times: total: 5.37 s Wall time: 1.58 s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 1.5s finished
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
model_metrics = {
'MAPE': metrics.mean_absolute_percentage_error(y_test, y_pred),
'features_count': len(X.columns),
'learning_time': learning_time,
# 'predict_time': predict_time,
}
# Наше Все
print('-------------------------')
print('MAPE:', model_metrics['MAPE'])
print('-------------------------')
# MAPE: 0.14155952662243707
# MAPE: 0.13839899541270664
# MAPE: 0.13790204359287050
------------------------- MAPE: 0.14155952662243704 -------------------------
# def mean_absolute_percentage_error(y_test, y_pred):
# return np.mean(np.abs((y_test - y_pred) / y_test))
# print('MAPE:', mean_absolute_percentage_error(y_test, y_pred))
# в RandomForestRegressor есть возможность вывести самые важные признаки для модели
fig, ax = plt.subplots(figsize=(10,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh',ax=ax)
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='feature_importances', step=exp_version[1]);
# Сохраняем метрики, закрываем логирование эксперимента
if COMET_LOG_ENABLE:
experiment.log_metrics(model_metrics, step=exp_version[1])
experiment.end()
COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: url : https://www.comet.com/dheglsfds/sfds-project-3/86462d90bd474f5eaefafb7fec38c14c COMET INFO: Metrics: COMET INFO: MAPE : 0.14155952662243704 COMET INFO: features_count : 5 COMET INFO: learning_time : 43.04143762588501 COMET INFO: Others: COMET INFO: Name : baseline COMET INFO: Parameters: COMET INFO: features_count : 5 COMET INFO: features_name : ['average_score', 'review_total_negative_word_counts', 'review_total_positive_word_counts', 'total_number_of_reviews_reviewer_has_given', 'total_number_of_reviews'] COMET INFO: model : RandomForestRegressor COMET INFO: n_estimators : 100 COMET INFO: n_jobs : -1 COMET INFO: random_state : 4242 COMET INFO: target : reviewer_score COMET INFO: verbose : 1 COMET INFO: Uploads: COMET INFO: figures : 4 COMET INFO: COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds) COMET INFO: The Python SDK has 3600 seconds to finish uploading collected data COMET INFO: Uploading 1 metrics, params and output messages
exp_log['{}.{}'.format(exp_version[0], exp_version[1])] = {'name':exp_name,'params':params, 'metrics': model_metrics}
# Накопительный итог
for v, exp in exp_log.items():
print(exp['name'], v)
print(' mape:', round(exp['metrics']['MAPE'], 6))
print(' feature count:', exp['params']['features_count'])
print(' leaning time:', round(exp['metrics']['learning_time'], 2), 's')
baseline 1.1 mape: 0.14156 feature count: 5 leaning time: 43.04 s
exp_version[1] += 1
exp_name = 'exchenge'
if COMET_LOG_ENABLE:
experiment = comet.Experiment(**EXPERIMENT_PARAMS)
experiment.set_name(exp_name)
experiment.add_tag('learning {}.{}'.format(exp_version[0], exp_version[1]))
COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. COMET INFO: Couldn't find a Git repository in 'C:\\Python\\learning\\eda.kaggle\\project-3-kaggle\\code' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY` COMET INFO: Experiment is live on comet.com https://www.comet.com/dheglsfds/sfds-project-3/12b7f79a208141798649a391053094a9
train_columns = []
train_num_columns = [
# 'average_score',
'reviever_mean_score',
# 'reviever_median_score',
'review_total_negative_word_counts',
'review_total_positive_word_counts',
'total_number_of_reviews_reviewer_has_given',
'total_number_of_reviews',
#'additional_number_of_scoring',
# 'lat',
# 'lng',
'stayed_nights',
]
train_columns.extend(train_num_columns)
#pearson, pearson, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 7))
ax.set_title('Карта корреляции числовых признаков')
matrix_corr = hotels_df[train_num_columns+[TARGET]].corr(method='pearson')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='pearson_correlation', step=exp_version[1]);
#hotels_df.
train_bin_columns = [
'is_solo_travel',
'is_couple_travel',
'is_family_travel',
'is_group_travel',
'is_business',
'top_10_nation',
'low_10_nation',
]
train_cat_columns = [
'month',
'country',
# 'reviewer_nationality',
]
train_columns.extend(train_bin_columns+train_cat_columns)
hotels_df[train_columns].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reviever_mean_score 386496 non-null float64 1 review_total_negative_word_counts 386496 non-null int64 2 review_total_positive_word_counts 386496 non-null int64 3 total_number_of_reviews_reviewer_has_given 386496 non-null int64 4 total_number_of_reviews 386496 non-null int64 5 stayed_nights 386496 non-null int32 6 is_solo_travel 386496 non-null bool 7 is_couple_travel 386496 non-null bool 8 is_family_travel 386496 non-null bool 9 is_group_travel 386496 non-null bool 10 is_business 386496 non-null int64 11 top_10_nation 386496 non-null int64 12 low_10_nation 386496 non-null int64 13 month 386496 non-null int32 14 country 386496 non-null object dtypes: bool(4), float64(1), int32(2), int64(7), object(1) memory usage: 42.0+ MB
#pearson, spearman, kendall
num_as_cat_cols = [
# 'average_score',
'reviever_mean_score',
# 'reviever_median_score',
]
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 8))
ax.set_title('Карта корреляции категориальных признаков')
matrix_corr = hotels_df[train_cat_columns+train_bin_columns+num_as_cat_cols+[TARGET]].corr(method='spearman')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='spearman_correlation', step=exp_version[1]);
# Здесь могут быть ньюансы, выделим отдельным подблоком
train_data = hotels_df[train_columns + [TARGET]].copy()
# Все булевые признаки переводим в числовые [0,1]
# convert bool fields to int
for column in train_data.columns:
if train_data[column].dtype == 'bool':
train_data[column] = train_data[column].astype(int)
# ## Binary Encode
# encode_cols=['reviewer_nationality']
# ce_binary = ce.BinaryEncoder(cols=encode_cols)
# train_data = ce_binary.fit_transform(train_data)
## One Hot Encode
encode_cols=['month', 'country'] #
ce_onehot = ce.OneHotEncoder(cols=encode_cols, use_cat_names=True)
train_data = ce_onehot.fit_transform(train_data)
train_data.head(3)
| reviever_mean_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | stayed_nights | is_solo_travel | is_couple_travel | is_family_travel | is_group_travel | ... | month_7.0 | month_4.0 | month_8.0 | country_United Kingdom | country_France | country_Netherlands | country_Italy | country_Austria | country_Spain | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3 | 3 | 4 | 7 | 1994 | 2 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 10.0 |
| 1 | 8.4 | 3 | 2 | 14 | 1361 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6.3 |
| 2 | 9.0 | 6 | 0 | 14 | 406 | 3 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7.5 |
3 rows × 32 columns
## Лишее это, не помогло
# from sklearn.preprocessing import RobustScaler
# scaled_cols = ['review_total_positive_word_counts', 'review_total_negative_word_counts']
# _scaler = RobustScaler()
# _scaler.fit(train_data[scaled_cols])
# train_data[scaled_cols] = _scaler.fit_transform(train_data[scaled_cols])
from sklearn.preprocessing import MinMaxScaler
_scaler = MinMaxScaler()
_scaler.fit(train_data[train_num_columns])
train_data[train_num_columns] = _scaler.fit_transform(train_data[train_num_columns])
train_data.describe().round(2)
| reviever_mean_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | stayed_nights | is_solo_travel | is_couple_travel | is_family_travel | is_group_travel | ... | month_7.0 | month_4.0 | month_8.0 | country_United Kingdom | country_France | country_Netherlands | country_Italy | country_Austria | country_Spain | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | ... | 386496.00 | 386496.00 | 386496.0 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 |
| mean | 0.72 | 0.05 | 0.05 | 0.02 | 0.16 | 0.08 | 0.21 | 0.49 | 0.17 | 0.13 | ... | 0.10 | 0.08 | 0.1 | 0.51 | 0.12 | 0.11 | 0.07 | 0.08 | 0.12 | 8.40 |
| std | 0.13 | 0.07 | 0.05 | 0.03 | 0.14 | 0.05 | 0.41 | 0.50 | 0.38 | 0.33 | ... | 0.29 | 0.28 | 0.3 | 0.50 | 0.32 | 0.31 | 0.26 | 0.26 | 0.32 | 1.64 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.50 |
| 25% | 0.65 | 0.00 | 0.01 | 0.00 | 0.07 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.50 |
| 50% | 0.74 | 0.02 | 0.03 | 0.01 | 0.13 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.80 |
| 75% | 0.80 | 0.06 | 0.06 | 0.02 | 0.22 | 0.10 | 0.00 | 1.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 9.60 |
| max | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.0 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 10.00 |
8 rows × 32 columns
#pearson, spearman, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(24, 22))
ax.set_title('Карта корреляции всех признаков модели')
matrix_corr = train_data.corr(method='pearson').round(2)
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='all_predict_corr', step=exp_version[1]);
X = train_data.drop([TARGET], axis=1)
y = train_data[TARGET].values
## непрерывные признаки
num_cols = [x for x in X.columns if X[x].nunique()>2]
if num_cols:
y_ = y.astype('int')
from sklearn.feature_selection import f_classif # anova
imp_num = pd.Series(f_classif(X[num_cols], y_)[0], index = num_cols)
imp_num.sort_values(inplace = True)
imp_num.plot(kind = 'barh');
# категориальные признаки
cat_cols = [x for x in X.columns if X[x].nunique()==2]
if cat_cols:
y_ = y.astype('int')
from sklearn.feature_selection import chi2 # хи-квадрат
imp_cat = pd.Series(chi2(X[cat_cols], y_)[0], index=cat_cols)
imp_cat.sort_values(inplace=True)
imp_cat.plot(kind='barh', figsize=(7,6));
# Воспользуемся специальной функцие train_test_split для разбивки тестовых данных
# выделим 20% данных на валидацию (параметр test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
# Создаём модель (НАСТРОЙКИ НЕ ТРОГАЕМ)
params = dict(
n_estimators=100,
verbose=1,
n_jobs=-1,
random_state=RANDOM_SEED)
model = RandomForestRegressor(**params)
# логируем гиперпараметры модели
params['model'] ='RandomForestRegressor'
params['target'] = TARGET
params['features_count'] = len(X.columns)
params['features_name'] = X.columns.to_list()
if COMET_LOG_ENABLE:
experiment.log_parameters(parameters=params, step=exp_version[1]) #, step=1, prefix='comet_'
#display(params)
%%time
start_learn = time.time()
# Обучаем модель на тестовом наборе данных
model.fit(X_train, y_train)
learning_time = time.time() - start_learn
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 50.0s
CPU times: total: 7min 3s Wall time: 1min 55s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 1.9min finished
%%time
start_predict = time.time()
# Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.
# Предсказанные значения записываем в переменную y_pred
y_pred = model.predict(X_test)
predict_time = time.time() - start_predict
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.7s
CPU times: total: 5.83 s Wall time: 1.69 s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 1.6s finished
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
model_metrics = {
'MAPE': metrics.mean_absolute_percentage_error(y_test, y_pred),
'features_count': len(X.columns),
'learning_time': learning_time,
# 'predict_time': predict_time,
}
# Наше Все
print('-------------------------')
print('MAPE:', model_metrics['MAPE'])
print('-------------------------')
# MAPE: 0.13563398464623694
------------------------- MAPE: 0.13527112654125675 -------------------------
# в RandomForestRegressor есть возможность вывести самые важные признаки для модели
fig, ax = plt.subplots(figsize=(10,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh',ax=ax);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='feature_importances', step=exp_version[1]);
# Сохраняем метрики, закрываем логирование эксперимента
if COMET_LOG_ENABLE:
experiment.log_metrics(model_metrics, step=exp_version[1])
experiment.end()
COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: url : https://www.comet.com/dheglsfds/sfds-project-3/12b7f79a208141798649a391053094a9 COMET INFO: Metrics: COMET INFO: MAPE : 0.13527112654125675 COMET INFO: features_count : 31 COMET INFO: learning_time : 115.07357382774353 COMET INFO: Others: COMET INFO: Name : exchenge COMET INFO: Parameters: COMET INFO: features_count : 31 COMET INFO: features_name : ['reviever_mean_score', 'review_total_negative_word_counts', 'review_total_positive_word_counts', 'total_number_of_reviews_reviewer_has_given', 'total_number_of_reviews', 'stayed_nights', 'is_solo_travel', 'is_couple_travel', 'is_family_travel', 'is_group_travel', 'is_business', 'top_10_nation', 'low_10_nation', 'month_2.0', 'month_1.0', 'month_10.0', 'month_9.0', 'month_3.0', 'month_12.0', 'month_5.0', 'month_11.0', 'month_6.0', 'month_7.0', 'month_4.0', 'month_8.0', 'country_United Kingdom', 'country_France', 'country_Netherlands', 'country_Italy', 'country_Austria', 'country_Spain'] COMET INFO: model : RandomForestRegressor COMET INFO: n_estimators : 100 COMET INFO: n_jobs : -1 COMET INFO: random_state : 4242 COMET INFO: target : reviewer_score COMET INFO: verbose : 1 COMET INFO: Uploads: COMET INFO: figures : 4 COMET INFO: COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET INFO: Uploading 1 metrics, params and output messages
exp_log['{}.{}'.format(exp_version[0], exp_version[1])] = {'name':exp_name,'params':params, 'metrics': model_metrics}
# Накопительный итог
for v, exp in exp_log.items():
print(exp['name'], v)
print(' mape:', round(exp['metrics']['MAPE'], 6))
print(' feature count:', exp['params']['features_count'])
print(' leaning time:', round(exp['metrics']['learning_time'], 2), 's')
baseline 1.1 mape: 0.14156 feature count: 5 leaning time: 43.04 s exchenge 1.2 mape: 0.135271 feature count: 31 leaning time: 115.07 s
exp_version[1] += 1
exp_name = 'optimal'
if COMET_LOG_ENABLE:
experiment = comet.Experiment(**EXPERIMENT_PARAMS)
experiment.set_name(exp_name)
experiment.add_tag('learning {}.{}'.format(exp_version[0], exp_version[1]))
COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. COMET INFO: Couldn't find a Git repository in 'C:\\Python\\learning\\eda.kaggle\\project-3-kaggle\\code' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY` COMET INFO: Experiment is live on comet.com https://www.comet.com/dheglsfds/sfds-project-3/1f1ce83817474707a6cb736c7ceb693c
train_columns = []
train_num_columns = [
'average_score',
# 'reviever_mean_score',
# 'reviever_median_score',
'review_total_negative_word_counts',
'review_total_positive_word_counts',
'total_number_of_reviews_reviewer_has_given',
'total_number_of_reviews',
# 'additional_number_of_scoring',
'lat',
'lng',
'stayed_nights',
'neg_ratio',
'pos_ratio',
]
train_columns.extend(train_num_columns)
#pearson, pearson, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 7))
ax.set_title('Карта корреляции числовых признаков')
matrix_corr = hotels_df[train_num_columns+[TARGET]].corr(method='pearson')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='pearson_correlation', step=exp_version[1]);
train_bin_columns = [
'is_business',
'is_solo_travel',
'is_couple_travel',
'is_group_travel',
'is_family_travel',
# 'top_10_nation',
# 'low_10_nation',
]
train_cat_columns = [
'month',
# 'country',
'reviewer_nationality',
]
train_columns.extend(train_cat_columns+train_bin_columns)
hotels_df[train_columns].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 average_score 386496 non-null float64 1 review_total_negative_word_counts 386496 non-null int64 2 review_total_positive_word_counts 386496 non-null int64 3 total_number_of_reviews_reviewer_has_given 386496 non-null int64 4 total_number_of_reviews 386496 non-null int64 5 lat 386496 non-null float64 6 lng 386496 non-null float64 7 stayed_nights 386496 non-null int32 8 neg_ratio 386496 non-null float64 9 pos_ratio 386496 non-null float64 10 month 386496 non-null int32 11 reviewer_nationality 386496 non-null category 12 is_business 386496 non-null int64 13 is_solo_travel 386496 non-null bool 14 is_couple_travel 386496 non-null bool 15 is_group_travel 386496 non-null bool 16 is_family_travel 386496 non-null bool dtypes: bool(4), category(1), float64(5), int32(2), int64(5) memory usage: 45.7 MB
#pearson, spearman, kendall
num_as_cat_cols = [
'average_score',
# 'reviever_mean_score',
# 'reviever_median_score',
]
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 8))
ax.set_title('Карта корреляции категориальных признаков')
matrix_corr = hotels_df[train_cat_columns+train_bin_columns+num_as_cat_cols+[TARGET]].corr(method='spearman')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='spearman_correlation', step=exp_version[1]);
# Здесь могут быть ньюансы, выделим отдельным подблоком
train_data = hotels_df[train_columns + [TARGET]].copy()
# Все булевые признаки переводим в числовые [0,1]
# convert bool fields to int
for column in train_data.columns:
if train_data[column].dtype == 'bool':
train_data[column] = train_data[column].astype(int)
# Binary Encode
encode_cols=['reviewer_nationality']
ce_binary = ce.BinaryEncoder(cols=encode_cols)
train_data = ce_binary.fit_transform(train_data)
# One Hot Encode
#encode_cols=['month', 'country']
encode_cols=['month']
ce_onehot = ce.OneHotEncoder(cols=encode_cols, use_cat_names=True)
train_data = ce_onehot.fit_transform(train_data)
train_data.head(5)
| average_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | lat | lng | stayed_nights | neg_ratio | pos_ratio | ... | reviewer_nationality_4 | reviewer_nationality_5 | reviewer_nationality_6 | reviewer_nationality_7 | is_business | is_solo_travel | is_couple_travel | is_group_travel | is_family_travel | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.4 | 3 | 4 | 7 | 1994 | 51.507894 | -0.143671 | 2 | 0.000 | 0.559 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 10.0 |
| 1 | 8.3 | 3 | 2 | 14 | 1361 | 51.521009 | -0.123097 | 1 | 0.608 | 0.000 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 6.3 |
| 2 | 8.9 | 6 | 0 | 14 | 406 | 48.845377 | 2.325643 | 3 | 0.663 | 0.000 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 7.5 |
| 3 | 7.5 | 0 | 11 | 8 | 607 | 48.888697 | 2.394540 | 1 | 0.000 | 0.767 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 10.0 |
| 4 | 8.5 | 4 | 20 | 10 | 7586 | 52.385601 | 4.847060 | 6 | 0.073 | 0.340 | ... | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 9.6 |
5 rows × 36 columns
from sklearn.preprocessing import MinMaxScaler
_scaler = MinMaxScaler()
_scaler.fit(train_data[train_num_columns])
train_data[train_num_columns] = _scaler.fit_transform(train_data[train_num_columns])
train_data.describe().round(2)
| average_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | lat | lng | stayed_nights | neg_ratio | pos_ratio | ... | reviewer_nationality_4 | reviewer_nationality_5 | reviewer_nationality_6 | reviewer_nationality_7 | is_business | is_solo_travel | is_couple_travel | is_group_travel | is_family_travel | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | ... | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 |
| mean | 0.70 | 0.05 | 0.05 | 0.02 | 0.16 | 0.73 | 0.19 | 0.08 | 0.06 | 0.29 | ... | 0.26 | 0.32 | 0.18 | 0.71 | 0.16 | 0.21 | 0.49 | 0.13 | 0.17 | 8.40 |
| std | 0.12 | 0.07 | 0.05 | 0.03 | 0.14 | 0.31 | 0.28 | 0.05 | 0.10 | 0.22 | ... | 0.44 | 0.47 | 0.39 | 0.46 | 0.37 | 0.41 | 0.50 | 0.33 | 0.38 | 1.64 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.50 |
| 25% | 0.63 | 0.00 | 0.01 | 0.00 | 0.07 | 0.62 | 0.01 | 0.03 | 0.00 | 0.12 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.50 |
| 50% | 0.70 | 0.02 | 0.03 | 0.01 | 0.13 | 0.92 | 0.02 | 0.07 | 0.00 | 0.26 | ... | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 8.80 |
| 75% | 0.78 | 0.06 | 0.06 | 0.02 | 0.22 | 0.92 | 0.31 | 0.10 | 0.09 | 0.42 | ... | 1.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 9.60 |
| max | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 10.00 |
8 rows × 36 columns
# from sklearn.preprocessing import StandardScaler
# _scaler = StandardScaler()
# _scaler.fit(train_data)
# train_data[train_columns] = _scaler.fit_transform(train_data)
#pearson, spearman, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(24, 22))
ax.set_title('Карта корреляции всех признаков модели')
matrix_corr = train_data.corr(method='pearson').round(2)
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='all_predict_corr', step=exp_version[1]);
X = train_data.drop([TARGET], axis=1)
y = train_data[TARGET].values
## непрерывные признаки
num_cols = [x for x in X.columns if X[x].nunique()>2]
if num_cols:
y_ = y.astype('int')
from sklearn.feature_selection import f_classif # anova
imp_num = pd.Series(f_classif(X[num_cols], y_)[0], index = num_cols)
imp_num.sort_values(inplace = True)
imp_num.plot(kind = 'barh');
# О, как. Интерсно, что скажет модель.
# категориальные признаки
cat_cols = [x for x in X.columns if X[x].nunique()==2]
if cat_cols:
y_ = y.astype('int')
from sklearn.feature_selection import chi2 # хи-квадрат
imp_cat = pd.Series(chi2(X[cat_cols], y_)[0], index=cat_cols)
imp_cat.sort_values(inplace=True)
imp_cat.plot(kind='barh', figsize=(6,5));
#erunda kokaya to
# Воспользуемся специальной функцие train_test_split для разбивки тестовых данных
# выделим 20% данных на валидацию (параметр test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
# Создаём модель (НАСТРОЙКИ НЕ ТРОГАЕМ)
params = dict(
n_estimators=100,
verbose=1,
n_jobs=-1,
random_state=RANDOM_SEED)
model = RandomForestRegressor(**params)
# логируем гиперпараметры модели
params['model'] ='RandomForestRegressor'
params['target'] = TARGET
params['features_count'] = len(X.columns)
params['features_name'] = X.columns.to_list()
if COMET_LOG_ENABLE:
experiment.log_parameters(parameters=params, step=exp_version[1]) #, step=1, prefix='comet_'
#display(params)
%%time
start_learn = time.time()
# Обучаем модель на тестовом наборе данных
model.fit(X_train, y_train)
learning_time = time.time() - start_learn
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.1min
CPU times: total: 9min 33s Wall time: 2min 33s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.5min finished
%%time
start_predict = time.time()
# Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.
# Предсказанные значения записываем в переменную y_pred
y_pred = model.predict(X_test)
predict_time = time.time() - start_predict
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.6s
CPU times: total: 5.99 s Wall time: 1.68 s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 1.6s finished
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
model_metrics = {
'MAPE': metrics.mean_absolute_percentage_error(y_test, y_pred),
'features_count': len(X.columns),
'learning_time': learning_time,
# 'predict_time': predict_time,
}
# Наше Все
print('--------------------------')
print('MAPE:', model_metrics['MAPE'])
print('--------------------------')
# MAPE: 0.12805043431577218
-------------------------- MAPE: 0.12803328434338665 --------------------------
# в RandomForestRegressor есть возможность вывести самые важные признаки для модели
fig, ax = plt.subplots(figsize=(10,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh',ax=ax);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='feature_importances');
# Сохраняем метрики, закрываем логирование эксперимента
if COMET_LOG_ENABLE:
experiment.log_metrics(model_metrics, step=exp_version[1])
experiment.end()
COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: url : https://www.comet.com/dheglsfds/sfds-project-3/1f1ce83817474707a6cb736c7ceb693c COMET INFO: Metrics: COMET INFO: MAPE : 0.12803328434338665 COMET INFO: features_count : 35 COMET INFO: learning_time : 153.14375925064087 COMET INFO: Others: COMET INFO: Name : optimal COMET INFO: Parameters: COMET INFO: features_count : 35 COMET INFO: features_name : ['average_score', 'review_total_negative_word_counts', 'review_total_positive_word_counts', 'total_number_of_reviews_reviewer_has_given', 'total_number_of_reviews', 'lat', 'lng', 'stayed_nights', 'neg_ratio', 'pos_ratio', 'month_2.0', 'month_1.0', 'month_10.0', 'month_9.0', 'month_3.0', 'month_12.0', 'month_5.0', 'month_11.0', 'month_6.0', 'month_7.0', 'month_4.0', 'month_8.0', 'reviewer_nationality_0', 'reviewer_nationality_1', 'reviewer_nationality_2', 'reviewer_nationality_3', 'reviewer_nationality_4', 'reviewer_nationality_5', 'reviewer_nationality_6', 'reviewer_nationality_7', 'is_business', 'is_solo_travel', 'is_couple_travel', 'is_group_travel', 'is_family_travel'] COMET INFO: model : RandomForestRegressor COMET INFO: n_estimators : 100 COMET INFO: n_jobs : -1 COMET INFO: random_state : 4242 COMET INFO: target : reviewer_score COMET INFO: verbose : 1 COMET INFO: Uploads: COMET INFO: figures : 4 COMET INFO: COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET INFO: Uploading 1 metrics, params and output messages
exp_log['{}.{}'.format(exp_version[0], exp_version[1])] = {'name':exp_name,'params':params, 'metrics': model_metrics}
# Накопительный итог
for v, exp in exp_log.items():
print(exp['name'], v)
print(' mape:', round(exp['metrics']['MAPE'], 6))
print(' feature count:', exp['params']['features_count'])
print(' leaning time:', round(exp['metrics']['learning_time'], 2), 's')
baseline 1.1 mape: 0.14156 feature count: 5 leaning time: 43.04 s exchenge 1.2 mape: 0.135271 feature count: 31 leaning time: 115.07 s optimal 1.3 mape: 0.128033 feature count: 35 leaning time: 153.14 s
exp_version[1] += 1
exp_name = 'extrime'
if COMET_LOG_ENABLE:
experiment = comet.Experiment(**EXPERIMENT_PARAMS)
experiment.set_name(exp_name)
experiment.add_tag('learning {}.{}'.format(exp_version[0], exp_version[1]))
COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET WARNING: As you are running in a Jupyter environment, you will need to call `experiment.end()` when finished to ensure all metrics and code are logged before exiting. COMET INFO: Couldn't find a Git repository in 'C:\\Python\\learning\\eda.kaggle\\project-3-kaggle\\code' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY` COMET INFO: Experiment is live on comet.com https://www.comet.com/dheglsfds/sfds-project-3/5347811e0fa9477f9680a400d64210c0
train_columns = []
train_num_columns = [
# 'average_score',
'reviever_mean_score',
# 'reviever_median_score',
'review_total_negative_word_counts',
'review_total_positive_word_counts',
'total_number_of_reviews_reviewer_has_given',
'total_number_of_reviews',
#'additional_number_of_scoring',
'lat',
'lng',
'stayed_nights',
'neg_ratio',
'pos_ratio',
#'days_since_review',
]
train_columns.extend(train_num_columns)
#pearson, pearson, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 7))
ax.set_title('Карта корреляции числовых признаков')
matrix_corr = hotels_df[train_columns+[TARGET]].corr(method='pearson')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='pearson_correlation', step=exp_version[1]);
train_bin_columns = [
'is_solo_travel',
'is_couple_travel',
'is_group_travel',
'is_family_travel',
'is_business',
'low_10_nation',
'top_10_nation',
# 'up_score_leader',
# 'down_score_leader'
]
train_cat_columns = [
'month',
'country',
'reviewer_nationality',
]
train_columns.extend(train_cat_columns+train_bin_columns)
hotels_df[train_columns + [TARGET]].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 386496 entries, 0 to 386802 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reviever_mean_score 386496 non-null float64 1 review_total_negative_word_counts 386496 non-null int64 2 review_total_positive_word_counts 386496 non-null int64 3 total_number_of_reviews_reviewer_has_given 386496 non-null int64 4 total_number_of_reviews 386496 non-null int64 5 lat 386496 non-null float64 6 lng 386496 non-null float64 7 stayed_nights 386496 non-null int32 8 neg_ratio 386496 non-null float64 9 pos_ratio 386496 non-null float64 10 month 386496 non-null int32 11 country 386496 non-null object 12 reviewer_nationality 386496 non-null category 13 is_solo_travel 386496 non-null bool 14 is_couple_travel 386496 non-null bool 15 is_group_travel 386496 non-null bool 16 is_family_travel 386496 non-null bool 17 is_business 386496 non-null int64 18 low_10_nation 386496 non-null int64 19 top_10_nation 386496 non-null int64 20 reviewer_score 386496 non-null float64 dtypes: bool(4), category(1), float64(6), int32(2), int64(7), object(1) memory usage: 57.5+ MB
#pearson, spearman, kendall
num_as_cat_cols = [
# 'average_score',
'reviever_mean_score',
# 'reviever_median_score',
]
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(9, 8))
ax.set_title('Карта корреляции категориальных признаков')
matrix_corr = hotels_df[train_cat_columns+train_bin_columns+num_as_cat_cols+[TARGET]].corr(method='spearman')
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='spearman_correlation', step=exp_version[1]);
# Здесь могут быть ньюансы, выделим отдельным подблоком
train_data = hotels_df[train_columns + [TARGET]].copy()
train_data.head(2)
| reviever_mean_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | lat | lng | stayed_nights | neg_ratio | pos_ratio | ... | country | reviewer_nationality | is_solo_travel | is_couple_travel | is_group_travel | is_family_travel | is_business | low_10_nation | top_10_nation | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3 | 3 | 4 | 7 | 1994 | 51.507894 | -0.143671 | 2 | 0.000 | 0.559 | ... | United Kingdom | United Kingdom | False | True | False | False | 0 | 0 | 1 | 10.0 |
| 1 | 8.4 | 3 | 2 | 14 | 1361 | 51.521009 | -0.123097 | 1 | 0.608 | 0.000 | ... | United Kingdom | United Kingdom | False | True | False | False | 1 | 0 | 1 | 6.3 |
2 rows × 21 columns
# Все булевые признаки переводим в числовые [0,1]
# convert bool fields to int
for column in train_data.columns:
if train_data[column].dtype == 'bool':
train_data[column] = train_data[column].astype(int)
# Binary Encode
encode_cols=['reviewer_nationality']
ce_binary = ce.BinaryEncoder(cols=encode_cols)
train_data = ce_binary.fit_transform(train_data)
# One Hot Encode
encode_cols=['month', 'country']
#encode_cols=['month']
ce_onehot = ce.OneHotEncoder(cols=encode_cols, use_cat_names=True)
train_data = ce_onehot.fit_transform(train_data)
train_data.head(5)
| reviever_mean_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | lat | lng | stayed_nights | neg_ratio | pos_ratio | ... | reviewer_nationality_6 | reviewer_nationality_7 | is_solo_travel | is_couple_travel | is_group_travel | is_family_travel | is_business | low_10_nation | top_10_nation | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3 | 3 | 4 | 7 | 1994 | 51.507894 | -0.143671 | 2 | 0.000 | 0.559 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 10.0 |
| 1 | 8.4 | 3 | 2 | 14 | 1361 | 51.521009 | -0.123097 | 1 | 0.608 | 0.000 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 6.3 |
| 2 | 9.0 | 6 | 0 | 14 | 406 | 48.845377 | 2.325643 | 3 | 0.663 | 0.000 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 7.5 |
| 3 | 6.9 | 0 | 11 | 8 | 607 | 48.888697 | 2.394540 | 1 | 0.000 | 0.767 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 10.0 |
| 4 | 8.5 | 4 | 20 | 10 | 7586 | 52.385601 | 4.847060 | 6 | 0.073 | 0.340 | ... | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 9.6 |
5 rows × 44 columns
from sklearn.preprocessing import MinMaxScaler
_scaler = MinMaxScaler()
_scaler.fit(train_data[train_num_columns])
train_data[train_num_columns] = _scaler.fit_transform(train_data[train_num_columns])
train_data.describe().round(2)
| reviever_mean_score | review_total_negative_word_counts | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | total_number_of_reviews | lat | lng | stayed_nights | neg_ratio | pos_ratio | ... | reviewer_nationality_6 | reviewer_nationality_7 | is_solo_travel | is_couple_travel | is_group_travel | is_family_travel | is_business | low_10_nation | top_10_nation | reviewer_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | ... | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 | 386496.00 |
| mean | 0.72 | 0.05 | 0.05 | 0.02 | 0.16 | 0.73 | 0.19 | 0.08 | 0.06 | 0.29 | ... | 0.18 | 0.71 | 0.21 | 0.49 | 0.13 | 0.17 | 0.16 | 0.08 | 0.66 | 8.40 |
| std | 0.13 | 0.07 | 0.05 | 0.03 | 0.14 | 0.31 | 0.28 | 0.05 | 0.10 | 0.22 | ... | 0.39 | 0.46 | 0.41 | 0.50 | 0.33 | 0.38 | 0.37 | 0.28 | 0.47 | 1.64 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.50 |
| 25% | 0.65 | 0.00 | 0.01 | 0.00 | 0.07 | 0.62 | 0.01 | 0.03 | 0.00 | 0.12 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.50 |
| 50% | 0.74 | 0.02 | 0.03 | 0.01 | 0.13 | 0.92 | 0.02 | 0.07 | 0.00 | 0.26 | ... | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 8.80 |
| 75% | 0.80 | 0.06 | 0.06 | 0.02 | 0.22 | 0.92 | 0.31 | 0.10 | 0.09 | 0.42 | ... | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 9.60 |
| max | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 10.00 |
8 rows × 44 columns
# from sklearn.preprocessing import StandardScaler
# _scaler = StandardScaler()
# _scaler.fit(train_data)
# train_data[train_columns] = _scaler.fit_transform(train_data)
#pearson, spearman, kendall
heatmap_params = dict(annot=True, vmin=-1, vmax=1, center=0, linewidths=0.1, cmap='coolwarm')
fig, ax = plt.subplots(figsize=(24, 22))
ax.set_title('Карта корреляции всех признаков модели')
matrix_corr = train_data.corr(method='pearson').round(2)
sns.heatmap(matrix_corr, ax=ax, **heatmap_params);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='all_predict_corr', step=exp_version[1]);
X = train_data.drop([TARGET], axis=1)
y = train_data[TARGET].values
## непрерывные признаки
num_cols = [x for x in X.columns if X[x].nunique()>2]
if num_cols:
y_ = y.astype('int')
from sklearn.feature_selection import f_classif # anova
imp_num = pd.Series(f_classif(X[num_cols], y_)[0], index = num_cols)
imp_num.sort_values(inplace = True)
imp_num.plot(kind = 'barh');
# О, как. Интерсно, что скажет модель.
# категориальные признаки
cat_cols = [x for x in X.columns if X[x].nunique()==2]
if cat_cols:
y_ = y.astype('int')
from sklearn.feature_selection import chi2 # хи-квадрат
imp_cat = pd.Series(chi2(X[cat_cols], y_)[0], index=cat_cols)
imp_cat.sort_values(inplace=True)
imp_cat.plot(kind='barh', figsize=(7,6));
#erunda kokaya to
# Воспользуемся специальной функцие train_test_split для разбивки тестовых данных
# выделим 20% данных на валидацию (параметр test_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)
# Создаём модель (НАСТРОЙКИ НЕ ТРОГАЕМ)
params = dict(
n_estimators=100,
verbose=1,
n_jobs=-1,
random_state=RANDOM_SEED)
model = RandomForestRegressor(**params)
# логируем гиперпараметры модели
params['model'] ='RandomForestRegressor'
params['target'] = TARGET
params['features_count'] = len(X.columns)
params['features_name'] = X.columns.to_list()
if COMET_LOG_ENABLE:
experiment.log_parameters(parameters=params, step=exp_version[1]) #, step=1, prefix='comet_'
#display(params)
%%time
start_learn = time.time()
# Обучаем модель на тестовом наборе данных
model.fit(X_train, y_train)
learning_time = time.time() - start_learn
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.2min
CPU times: total: 10min 57s Wall time: 2min 56s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.9min finished
%%time
start_predict = time.time()
# Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.
# Предсказанные значения записываем в переменную y_pred
y_pred = model.predict(X_test)
predict_time = time.time() - start_predict
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.8s
CPU times: total: 6.27 s Wall time: 1.94 s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 1.8s finished
# Сравниваем предсказанные значения (y_pred) с реальными (y_test), и смотрим насколько они в среднем отличаются
# Метрика называется Mean Absolute Error (MAE) и показывает среднее отклонение предсказанных значений от фактических.
model_metrics = {
'MAPE': metrics.mean_absolute_percentage_error(y_test, y_pred),
'features_count': len(X.columns),
'learning_time': learning_time,
# 'predict_time': predict_time,
}
# Наше Все
print('--------------------------')
print('MAPE:', model_metrics['MAPE'])
print('--------------------------')
# MAPE: 0.12805043431577218
# MAPE: 0.12801268826507775
# MAPE: 0.12760662118105054
-------------------------- MAPE: 0.1276965122287925 --------------------------
# в RandomForestRegressor есть возможность вывести самые важные признаки для модели
# И мы их выведем!
fig, ax = plt.subplots(figsize=(10,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh',ax=ax);
if COMET_LOG_ENABLE:
experiment.log_figure(figure_name='feature_importances');
# Сохраняем метрики, закрываем логирование эксперимента
if COMET_LOG_ENABLE:
experiment.log_metrics(model_metrics, step=exp_version[1])
experiment.end()
COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Comet.ml Experiment Summary COMET INFO: --------------------------------------------------------------------------------------- COMET INFO: Data: COMET INFO: display_summary_level : 1 COMET INFO: url : https://www.comet.com/dheglsfds/sfds-project-3/5347811e0fa9477f9680a400d64210c0 COMET INFO: Metrics: COMET INFO: MAPE : 0.1276965122287925 COMET INFO: features_count : 43 COMET INFO: learning_time : 176.54359412193298 COMET INFO: Others: COMET INFO: Name : extrime COMET INFO: Parameters: COMET INFO: features_count : 43 COMET INFO: features_name : ['reviever_mean_score', 'review_total_negative_word_counts', 'review_total_positive_word_counts', 'total_number_of_reviews_reviewer_has_given', 'total_number_of_reviews', 'lat', 'lng', 'stayed_nights', 'neg_ratio', 'pos_ratio', 'month_2.0', 'month_1.0', 'month_10.0', 'month_9.0', 'month_3.0', 'month_12.0', 'month_5.0', 'month_11.0', 'month_6.0', 'month_7.0', 'month_4.0', 'month_8.0', 'country_United Kingdom', 'country_France', 'country_Netherlands', 'country_Italy', 'country_Austria', 'country_Spain', 'reviewer_nationality_0', 'reviewer_nationality_1', 'reviewer_nationality_2', 'reviewer_nationality_3', 'reviewer_nationality_4', 'reviewer_nationality_5', 'reviewer_nationality_6', 'reviewer_nationality_7', 'is_solo_travel', 'is_couple_travel', 'is_group_travel', 'is_family_travel', 'is_business', 'low_10_nation', 'top_10_nation'] COMET INFO: model : RandomForestRegressor COMET INFO: n_estimators : 100 COMET INFO: n_jobs : -1 COMET INFO: random_state : 4242 COMET INFO: target : reviewer_score COMET INFO: verbose : 1 COMET INFO: Uploads: COMET INFO: figures : 4 COMET INFO: COMET WARNING: Comet has disabled auto-logging functionality as it has been imported after the following ML modules: sklearn. Metrics and hyperparameters can still be logged using Experiment.log_metrics() and Experiment.log_parameters() COMET INFO: Uploading 1 metrics, params and output messages
exp_log['{}.{}'.format(exp_version[0], exp_version[1])] = {'name':exp_name,'params':params, 'metrics': model_metrics}
# Накопительный итог
for v, exp in exp_log.items():
print(exp['name'], v)
print(' mape:', round(exp['metrics']['MAPE'], 6))
print(' feature count:', exp['params']['features_count'])
print(' leaning time:', round(exp['metrics']['learning_time'], 2), 's')
baseline 1.1 mape: 0.14156 feature count: 5 leaning time: 43.04 s exchenge 1.2 mape: 0.135271 feature count: 31 leaning time: 115.07 s optimal 1.3 mape: 0.128033 feature count: 35 leaning time: 153.14 s extrime 1.4 mape: 0.127697 feature count: 43 leaning time: 176.54 s
# Выбрать эксперименты, начиная с нижнего и нажать кнопачку Diff в нижней области фрэйма
if COMET_LOG_ENABLE:
experiment.display_project()
print('END')
END
Насмотря на то что датасет далек от промышленого, это было занимательное приключение. Надеюсь в блоке ML мне раскажут чем это я тут таким занимался.